Transcriptome Data Analysis - hands-on session
@FedeBioinfoIn this tutorial we walk through a gene-level RNA-seq differential expression analysis using Bioconductor packages. We start from the gene-vs-sample count matrix, and thus assume that the raw reads have already been quality controlled and that the gene expression has been quantified (either using alignment and counting, or by applying an alignment-free quantification tool). We perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, then perform differential gene expression analysis, and visually explore the results.
Bioconductor has many packages supporting analysis of high-throughput sequence
data, including RNA-seq. The packages that we will use in this tutorial include
core packages maintained by the Bioconductor core
team for importing and
processing raw sequencing data and loading gene annotations. We will also use
contributed packages for statistical analysis and visualization of sequencing
data. Through scheduled releases every 6 months, the Bioconductor project
ensures that all the packages within a release will work together in harmony
(hence the “conductor” metaphor). The packages used in this tutorial are loaded
with the library function and can be installed by following the Bioconductor
package installation
instructions.
If you use the results from an R package in published research, you can find the
proper citation for the software by typing citation("pkgName"), where you
would substitute the name of the package for pkgName. Citing methods papers
helps to support and reward the individuals who put time into open source
software for genomic data analysis.
Many parts of this tutorial are based on a published RNA-seq workflow available via F1000Research (Love et al. 2015) and as a Bioconductor package.
The data used in this workflow comes from an RNA-seq experiment (Alasoo et al. 2018), in which the authors identified shared quantitative trait loci (QTLs) for chromatin accessibility and gene expression in human macrophages exposed to IFNgamma, Salmonella and IFNgamma plus Salmonella. Processed data from a subset of 24 samples from this experiment (six female donors, with four treatments each) is available via the macrophage R/Bioconductor package. In particular, the package contains output from Salmon (Patro et al. 2017), as well as a metadata file. More information about how the raw data was processed is available from the package vignette.
We start by setting the path to the folder containing the quantifications
(the output folders from Salmon). Since these are provided with an R package,
we will point to the extdata subfolder of the installed package. For a typical
analysis of your own data, you would point directly to a folder on your storage
system (i.e., not using system.file()).
library(macrophage)
dir <- system.file("extdata", package = "macrophage")
list.files(dir)
# [1] "coldata.csv" "errs"
# [3] "gencode.v29_salmon_0.12.0" "gencode.v29.annotation.gtf.gz"
# [5] "PRJEB18997.txt" "quants"
# [7] "supp_table_1.csv" "supp_table_7.csv"
TODO: open up that directory to “better see what is in it”? Idea: it can give them a sense for the what do I expect to get there?
First, we will read the metadata for the experiment. The main annotations of
interest for this tutorial are condition_name, which represents the
treatment of the sample (naive, IFN gamma, Salmonella, IFN gamma+Salmonella) and
line_id, which represents the donor ID. The
sample identifier is given by the names column, and will be used to match the
metadata table to the quantifications.
coldata <- read.csv(file.path(dir, "coldata.csv"))[, c(1, 2, 3, 5)]
dim(coldata)
# [1] 24 4
coldata
# names sample_id line_id condition_name
# 1 SAMEA103885102 diku_A diku_1 naive
# 2 SAMEA103885347 diku_B diku_1 IFNg
# 3 SAMEA103885043 diku_C diku_1 SL1344
# 4 SAMEA103885392 diku_D diku_1 IFNg_SL1344
# 5 SAMEA103885182 eiwy_A eiwy_1 naive
# 6 SAMEA103885136 eiwy_B eiwy_1 IFNg
# 7 SAMEA103885413 eiwy_C eiwy_1 SL1344
# 8 SAMEA103884967 eiwy_D eiwy_1 IFNg_SL1344
# 9 SAMEA103885368 fikt_A fikt_3 naive
# 10 SAMEA103885218 fikt_B fikt_3 IFNg
# 11 SAMEA103885319 fikt_C fikt_3 SL1344
# 12 SAMEA103885004 fikt_D fikt_3 IFNg_SL1344
# 13 SAMEA103885284 ieki_A ieki_2 naive
# 14 SAMEA103885059 ieki_B ieki_2 IFNg
# 15 SAMEA103884898 ieki_C ieki_2 SL1344
# 16 SAMEA103885157 ieki_D ieki_2 IFNg_SL1344
# 17 SAMEA103885111 podx_A podx_1 naive
# 18 SAMEA103884919 podx_B podx_1 IFNg
# 19 SAMEA103885276 podx_C podx_1 SL1344
# 20 SAMEA103885021 podx_D podx_1 IFNg_SL1344
# 21 SAMEA103885262 qaqx_A qaqx_1 naive
# 22 SAMEA103885228 qaqx_B qaqx_1 IFNg
# 23 SAMEA103885308 qaqx_C qaqx_1 SL1344
# 24 SAMEA103884949 qaqx_D qaqx_1 IFNg_SL1344
In addition to the names column, tximeta, which we will use to read the
quantification data, requires that coldata has a column named files,
pointing to the Salmon output (the quant.sf file) for the respective samples.
coldata$files <- file.path(dir, "quants", coldata$names, "quant.sf.gz")
head(coldata)
# names sample_id line_id condition_name
# 1 SAMEA103885102 diku_A diku_1 naive
# 2 SAMEA103885347 diku_B diku_1 IFNg
# 3 SAMEA103885043 diku_C diku_1 SL1344
# 4 SAMEA103885392 diku_D diku_1 IFNg_SL1344
# 5 SAMEA103885182 eiwy_A eiwy_1 naive
# 6 SAMEA103885136 eiwy_B eiwy_1 IFNg
# files
# 1 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885102/quant.sf.gz
# 2 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885347/quant.sf.gz
# 3 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885043/quant.sf.gz
# 4 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885392/quant.sf.gz
# 5 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885182/quant.sf.gz
# 6 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885136/quant.sf.gz
all(file.exists(coldata$files))
# [1] TRUE
Now we have everything we need, and can import the quantifications with tximeta. In this process, we will see that tximeta automatically identifies the source and version of the transcriptome reference that was used for the quantification, and adds some metadata. The imported data will be stored in a SummarizedExperiment container.
We will next read the Salmon quantifications provided in the macrophage package into R and summarize the expected counts on the gene level. A simple way to import results from a variety of transcript abundance estimation tools into R is provided by the tximport and tximeta packages. Here, tximport reads the quantifications into a list of matrices, while tximeta instead aggregates the information into a SummarizedExperiment object, and also automatically adds additional annotations for the features. Both packages can return quantifications on the transcript level or aggregate them on the gene level. They also calculate average transcript lengths for each gene and each sample, which can be used as offsets to improve the differential expression analysis by accounting for differential isoform usage across samples (Soneson, Love, and Robinson 2015).
The code below imports the Salmon quantifications into R using the tximeta package. Note how the transcriptome that was used for the quantification is automatically recognized and used to annotate the resulting data object. In order for this to work, tximeta requires that the output folder structure from Salmon is retained, since it reads information from the associated log files in addition to the quantified abundances themselves.
suppressPackageStartupMessages({
library(tximeta)
library(DESeq2)
library(org.Hs.eg.db)
library(SummarizedExperiment)
})
## Import quantifications on the transcript level
st <- tximeta(coldata = coldata, type = "salmon", dropInfReps = TRUE)
st
# class: RangedSummarizedExperiment
# dim: 205870 24
# metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
# assays(3): counts abundance length
# rownames(205870): ENST00000456328.2 ENST00000450305.2 ...
# ENST00000387460.2 ENST00000387461.2
# rowData names(3): tx_id gene_id tx_name
# colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308
# SAMEA103884949
# colData names(4): names sample_id line_id condition_name
We see that tximeta has identified the transcriptome used for the
quantification as GENCODE - Homo sapiens - release 29. How did this happen?
In fact, the output directory from Salmon contains much more information than
just the quant.sf file! (as mentioned above, this means that it is not
advisable to move files out of the folder, or to share only the quant.sf
file, since the context is lost):
list.files(file.path(dir, "quants", coldata$names[1]), recursive = TRUE)
# [1] "aux_info/ambig_info.tsv.gz" "aux_info/bootstrap/bootstraps.gz"
# [3] "aux_info/bootstrap/names.tsv.gz" "aux_info/exp_gc.gz"
# [5] "aux_info/expected_bias.gz" "aux_info/fld.gz"
# [7] "aux_info/meta_info.json" "aux_info/obs_gc.gz"
# [9] "aux_info/observed_bias_3p.gz" "aux_info/observed_bias.gz"
# [11] "cmd_info.json" "lib_format_counts.json"
# [13] "libParams/flenDist.txt" "logs/salmon_quant.log.txt"
# [15] "quant.sf.gz"
In particular, the meta_info.json file contains a hash checksum, which is
derived from the set of transcripts used as reference during the quantification
and which lets tximeta identify the reference source (by comparing to a
table of these hash checksums for commonly used references).
rjson::fromJSON(file = file.path(dir, "quants", coldata$names[1],
"aux_info", "meta_info.json"))
# $salmon_version
# [1] "0.12.0"
#
# $samp_type
# [1] "gibbs"
#
# $opt_type
# [1] "vb"
#
# $quant_errors
# list()
#
# $num_libraries
# [1] 1
#
# $library_types
# [1] "ISR"
#
# $frag_dist_length
# [1] 1001
#
# $seq_bias_correct
# [1] FALSE
#
# $gc_bias_correct
# [1] TRUE
#
# $num_bias_bins
# [1] 4096
#
# $mapping_type
# [1] "mapping"
#
# $num_targets
# [1] 205870
#
# $serialized_eq_classes
# [1] FALSE
#
# $eq_class_properties
# list()
#
# $length_classes
# [1] 520 669 1065 2328 205012
#
# $index_seq_hash
# [1] "40849ed828ea7d6a94af54a5c40e5d87eb0ce0fc1e9513208a5cffe59d442292"
#
# $index_name_hash
# [1] "77aca5545a0626421efb4730dd7b95482c77da261f9bdef70d36e25ee68bb7ef"
#
# $index_seq_hash512
# [1] "f37ae2a7412efd8518d68c22fc3fcc2478b59833809382abd5d9055475505e516730c70914343af34c9232af92fe22832e70afde6b38c26381097068e1e82551"
#
# $index_name_hash512
# [1] "67f286e5ec2895c10aa6e3a8feed04cb3a80747de7b3afb620298078481b303d2f979407fc104d4f3e8af0e82be2acb2f294ee5f2f68c00087f2855120d9f4ed"
#
# $num_bootstraps
# [1] 20
#
# $num_processed
# [1] 45141218
#
# $num_mapped
# [1] 40075160
#
# $percent_mapped
# [1] 88.77731
#
# $call
# [1] "quant"
#
# $start_time
# [1] "Tue Jan 15 21:21:22 2019"
#
# $end_time
# [1] "Tue Jan 15 21:29:00 2019"
Looking at the size of the SummarizedExperiment object (205,870 rows!) as well
as the row names, we see that this object contains transcript-level information.
The assays are created by directly importing the values from the quant.sf
files and combining this information across the 24 samples:
NumReads columnTPM columnEffectiveLength columnWe can access any of the assays via the assay function:
head(assay(st, "counts"), 3)
# SAMEA103885102 SAMEA103885347 SAMEA103885043 SAMEA103885392
# ENST00000456328.2 22.058 12.404 5.44 0.000
# ENST00000450305.2 0.000 0.000 0.00 0.000
# ENST00000488147.1 119.092 180.069 161.55 93.747
# SAMEA103885182 SAMEA103885136 SAMEA103885413 SAMEA103884967
# ENST00000456328.2 0.0 10.833 5.119 6.708
# ENST00000450305.2 0.0 0.000 0.000 0.000
# ENST00000488147.1 145.5 141.607 189.152 98.019
# SAMEA103885368 SAMEA103885218 SAMEA103885319 SAMEA103885004
# ENST00000456328.2 0.000 4.484 0.000 0.000
# ENST00000450305.2 0.000 0.000 0.000 0.000
# ENST00000488147.1 132.243 88.429 96.871 66.813
# SAMEA103885284 SAMEA103885059 SAMEA103884898 SAMEA103885157
# ENST00000456328.2 27.337 12.401 0.000 6.107
# ENST00000450305.2 0.000 0.000 0.000 0.000
# ENST00000488147.1 250.127 211.475 144.212 134.842
# SAMEA103885111 SAMEA103884919 SAMEA103885276 SAMEA103885021
# ENST00000456328.2 23.333 11.794 4.670 0.000
# ENST00000450305.2 0.000 0.000 0.000 0.000
# ENST00000488147.1 205.167 151.599 134.082 69.402
# SAMEA103885262 SAMEA103885228 SAMEA103885308 SAMEA103884949
# ENST00000456328.2 7.629 3.907 9.195 0.000
# ENST00000450305.2 0.000 0.000 0.000 0.000
# ENST00000488147.1 154.885 125.882 183.322 53.233
You may have noted that st is in fact a RangedSummarizedExperiment
object (rather than “just” a SummarizedExperiment object). What does this mean?
Let’s look at the information we have about the rows (transcripts) in the object:
rowRanges(st)
# GRanges object with 205870 ranges and 3 metadata columns:
# seqnames ranges strand | tx_id gene_id
# <Rle> <IRanges> <Rle> | <integer> <CharacterList>
# ENST00000456328.2 chr1 11869-14409 + | 1 ENSG00000223972.5
# ENST00000450305.2 chr1 12010-13670 + | 2 ENSG00000223972.5
# ENST00000488147.1 chr1 14404-29570 - | 9483 ENSG00000227232.5
# ENST00000619216.1 chr1 17369-17436 - | 9484 ENSG00000278267.1
# ENST00000473358.1 chr1 29554-31097 + | 3 ENSG00000243485.5
# ... ... ... ... . ... ...
# ENST00000361681.2 chrM 14149-14673 - | 206692 ENSG00000198695.2
# ENST00000387459.1 chrM 14674-14742 - | 206693 ENSG00000210194.1
# ENST00000361789.2 chrM 14747-15887 + | 206684 ENSG00000198727.2
# ENST00000387460.2 chrM 15888-15953 + | 206685 ENSG00000210195.2
# ENST00000387461.2 chrM 15956-16023 - | 206694 ENSG00000210196.2
# tx_name
# <character>
# ENST00000456328.2 ENST00000456328.2
# ENST00000450305.2 ENST00000450305.2
# ENST00000488147.1 ENST00000488147.1
# ENST00000619216.1 ENST00000619216.1
# ENST00000473358.1 ENST00000473358.1
# ... ...
# ENST00000361681.2 ENST00000361681.2
# ENST00000387459.1 ENST00000387459.1
# ENST00000361789.2 ENST00000361789.2
# ENST00000387460.2 ENST00000387460.2
# ENST00000387461.2 ENST00000387461.2
# -------
# seqinfo: 25 sequences (1 circular) from hg38 genome
By knowing the source and version of the reference used for the quantification,
tximeta was able to retrieve the annotation files and decorate the object with
information about the transcripts, such as the chromosome and position, and the
corresponding gene ID. Importantly, Salmon did not use (or know about) any of
this during the quantification! It needs only the transcript sequences.
If we just want the annotation columns, without the ranges, we can get those
with the rowData accessor:
rowData(st)
# DataFrame with 205870 rows and 3 columns
# tx_id gene_id tx_name
# <integer> <CharacterList> <character>
# ENST00000456328.2 1 ENSG00000223972.5 ENST00000456328.2
# ENST00000450305.2 2 ENSG00000223972.5 ENST00000450305.2
# ENST00000488147.1 9483 ENSG00000227232.5 ENST00000488147.1
# ENST00000619216.1 9484 ENSG00000278267.1 ENST00000619216.1
# ENST00000473358.1 3 ENSG00000243485.5 ENST00000473358.1
# ... ... ... ...
# ENST00000361681.2 206692 ENSG00000198695.2 ENST00000361681.2
# ENST00000387459.1 206693 ENSG00000210194.1 ENST00000387459.1
# ENST00000361789.2 206684 ENSG00000198727.2 ENST00000361789.2
# ENST00000387460.2 206685 ENSG00000210195.2 ENST00000387460.2
# ENST00000387461.2 206694 ENSG00000210196.2 ENST00000387461.2
Similar to the row annotations in rowData, the SummarizedExperiment
object contains sample annotations in the colData slot.
colData(st)
# DataFrame with 24 rows and 4 columns
# names sample_id line_id condition_name
# <character> <character> <character> <character>
# SAMEA103885102 SAMEA103885102 diku_A diku_1 naive
# SAMEA103885347 SAMEA103885347 diku_B diku_1 IFNg
# SAMEA103885043 SAMEA103885043 diku_C diku_1 SL1344
# SAMEA103885392 SAMEA103885392 diku_D diku_1 IFNg_SL1344
# SAMEA103885182 SAMEA103885182 eiwy_A eiwy_1 naive
# ... ... ... ... ...
# SAMEA103885021 SAMEA103885021 podx_D podx_1 IFNg_SL1344
# SAMEA103885262 SAMEA103885262 qaqx_A qaqx_1 naive
# SAMEA103885228 SAMEA103885228 qaqx_B qaqx_1 IFNg
# SAMEA103885308 SAMEA103885308 qaqx_C qaqx_1 SL1344
# SAMEA103884949 SAMEA103884949 qaqx_D qaqx_1 IFNg_SL1344
As we saw, the features in the SummarizedExperiment object above are individual
transcripts, rather than genes.
Often, however, we want to do analysis on the gene level, since the gene-level
abundances are more robust and sometimes more interpretable than transcript-level
abundances.
The rowData contains the information about the corresponding gene for each
transcript, in the gene_id column, and tximeta provides a function
to summarize on the gene level:
## Summarize quantifications on the gene level
sg <- tximeta::summarizeToGene(st)
sg
# class: RangedSummarizedExperiment
# dim: 58294 24
# metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
# assays(3): counts abundance length
# rownames(58294): ENSG00000000003.14 ENSG00000000005.5 ...
# ENSG00000285993.1 ENSG00000285994.1
# rowData names(2): gene_id tx_ids
# colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308
# SAMEA103884949
# colData names(4): names sample_id line_id condition_name
# compare e.g. to
st
# class: RangedSummarizedExperiment
# dim: 205870 24
# metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
# assays(3): counts abundance length
# rownames(205870): ENST00000456328.2 ENST00000450305.2 ...
# ENST00000387460.2 ENST00000387461.2
# rowData names(3): tx_id gene_id tx_name
# colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308
# SAMEA103884949
# colData names(4): names sample_id line_id condition_name
Now we have a new RangedSummarizedExperiment object, with one row per gene.
The row ranges have been summarized as well, and can be used for subsetting
and interpretation just as for the transcripts.
At this point, the only information we have about the genes in our data set, apart from their genomic location and the associated transcript IDs, is the Ensembl ID. Often we need additional annotations, such as gene symbols. Bioconductor provides a range of annotation packages:
OrgDb packages, providing gene-based annotations for a given organismTxDb and EnsDb packages, providing transcript ranges for a given genome buildBSgenome packages, providing the genome sequence for a given genome buildFor our purposes here, the appropriate OrgDb package is the most suitable,
since it contains gene-centric ID conversion tables.
Since this is human data, we will use the org.Hs.eg.db package.
## Add gene symbols
sg <- tximeta::addIds(sg, "SYMBOL", gene = TRUE)
sg
# class: RangedSummarizedExperiment
# dim: 58294 24
# metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
# assays(3): counts abundance length
# rownames(58294): ENSG00000000003.14 ENSG00000000005.5 ...
# ENSG00000285993.1 ENSG00000285994.1
# rowData names(3): gene_id tx_ids SYMBOL
# colnames(24): SAMEA103885102 SAMEA103885347 ... SAMEA103885308
# SAMEA103884949
# colData names(4): names sample_id line_id condition_name
head(rowData(sg))
# DataFrame with 6 rows and 3 columns
# gene_id
# <character>
# ENSG00000000003.14 ENSG00000000003.14
# ENSG00000000005.5 ENSG00000000005.5
# ENSG00000000419.12 ENSG00000000419.12
# ENSG00000000457.13 ENSG00000000457.13
# ENSG00000000460.16 ENSG00000000460.16
# ENSG00000000938.12 ENSG00000000938.12
# tx_ids
# <CharacterList>
# ENSG00000000003.14 ENST00000612152.4,ENST00000373020.8,ENST00000614008.4,...
# ENSG00000000005.5 ENST00000373031.4,ENST00000485971.1
# ENSG00000000419.12 ENST00000371588.9,ENST00000466152.5,ENST00000371582.8,...
# ENSG00000000457.13 ENST00000367771.10,ENST00000367770.5,ENST00000367772.8,...
# ENSG00000000460.16 ENST00000498289.5,ENST00000472795.5,ENST00000496973.5,...
# ENSG00000000938.12 ENST00000374005.7,ENST00000399173.5,ENST00000374004.5,...
# SYMBOL
# <character>
# ENSG00000000003.14 TSPAN6
# ENSG00000000005.5 TNMD
# ENSG00000000419.12 DPM1
# ENSG00000000457.13 SCYL3
# ENSG00000000460.16 C1orf112
# ENSG00000000938.12 FGR
To see a list of the possible columns, use the columns function from the
AnnotationDbi package:
AnnotationDbi::columns(org.Hs.eg.db)
# [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
# [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
# [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
# [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
# [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
# [26] "UNIPROT"
We can even add annotations where we expect (and would like to retain) multiple mapping values, e.g., associated GO terms:
sg <- addIds(sg, "GO", multiVals = "list", gene = TRUE)
head(rowData(sg))
# DataFrame with 6 rows and 4 columns
# gene_id
# <character>
# ENSG00000000003.14 ENSG00000000003.14
# ENSG00000000005.5 ENSG00000000005.5
# ENSG00000000419.12 ENSG00000000419.12
# ENSG00000000457.13 ENSG00000000457.13
# ENSG00000000460.16 ENSG00000000460.16
# ENSG00000000938.12 ENSG00000000938.12
# tx_ids
# <CharacterList>
# ENSG00000000003.14 ENST00000612152.4,ENST00000373020.8,ENST00000614008.4,...
# ENSG00000000005.5 ENST00000373031.4,ENST00000485971.1
# ENSG00000000419.12 ENST00000371588.9,ENST00000466152.5,ENST00000371582.8,...
# ENSG00000000457.13 ENST00000367771.10,ENST00000367770.5,ENST00000367772.8,...
# ENSG00000000460.16 ENST00000498289.5,ENST00000472795.5,ENST00000496973.5,...
# ENSG00000000938.12 ENST00000374005.7,ENST00000399173.5,ENST00000374004.5,...
# SYMBOL GO
# <character> <list>
# ENSG00000000003.14 TSPAN6 GO:0005515,GO:0005887,GO:0039532,...
# ENSG00000000005.5 TNMD GO:0001937,GO:0005515,GO:0005635,...
# ENSG00000000419.12 DPM1 GO:0004169,GO:0004582,GO:0004582,...
# ENSG00000000457.13 SCYL3 GO:0000139,GO:0005515,GO:0005524,...
# ENSG00000000460.16 C1orf112 GO:0005515
# ENSG00000000938.12 FGR GO:0001784,GO:0001784,GO:0001819,...
Note that Salmon returns estimated or expected counts, which are not necessarily integers. They may need to be rounded before they are passed to count-based statistical methods. To obtain consistent results with different pipelines, we round the estimated counts here (note that in practice, DESeq2 will automatically round the counts, while edgeR will work well also with the non-integer values).
assay(sg, "counts") <- round(assay(sg, "counts"))
At this point, we have a gene-level count matrix, contained in our SummarizedExperiment object. This is a branching point where we could use a variety of Bioconductor packages for exploration and differential expression of the count matrix, including edgeR (Robinson, McCarthy, and Smyth 2009), DESeq2 (Love, Huber, and Anders 2014), limma with the voom method (Law et al. 2014), DSS (Wu, Wang, and Wu 2013), EBSeq (Leng et al. 2013) and BaySeq (Hardcastle and Kelly 2010). We will continue using DESeq2 and edgeR.
Bioconductor software packages often define and use a custom class for storing data that makes sure that all the needed data slots are consistently provided and fulfill any requirements. In addition, Bioconductor has general data classes (such as the SummarizedExperiment) that can be used to move data between packages. The DEFormats package can be useful for converting between different classes. The core Bioconductor classes also provide useful functionality: for example, subsetting or reordering the rows or columns of a SummarizedExperiment automatically subsets or reorders the associated rowRanges and colData, which can help to prevent accidental sample swaps that would otherwise lead to spurious results. With SummarizedExperiment this is all taken care of behind the scenes.
Each of the packages we will use for differential expression has a specific class of object used to store the summarization of the RNA-seq experiment and the intermediate quantities that are calculated during the statistical analysis of the data. DESeq2 uses a DESeqDataSet and edgeR uses a DGEList.
In DESeq2, the custom class is called DESeqDataSet. It is built on top of
the SummarizedExperiment class, and it is easy to convert
SummarizedExperiment objects into DESeqDataSet objects. One of the two main
differences compared to a SummarizedExperiment object is that the assay slot
can be accessed using the counts accessor function, and the DESeqDataSet
class enforces that the values in this matrix are non-negative integers.
A second difference is that the DESeqDataSet has an associated design
formula. The experimental design is specified at the beginning of the analysis,
as it will inform many of the DESeq2 functions how to treat the samples in the
analysis (one exception is the size factor estimation, i.e., the adjustment for
differing library sizes, which does not depend on the design formula). The
design formula tells which columns in the sample information table (colData)
specify the experimental design and how these factors should be used in the
analysis.
Let’s remind ourselves of the design of our experiment:
colData(sg)
# DataFrame with 24 rows and 4 columns
# names sample_id line_id condition_name
# <character> <character> <character> <character>
# SAMEA103885102 SAMEA103885102 diku_A diku_1 naive
# SAMEA103885347 SAMEA103885347 diku_B diku_1 IFNg
# SAMEA103885043 SAMEA103885043 diku_C diku_1 SL1344
# SAMEA103885392 SAMEA103885392 diku_D diku_1 IFNg_SL1344
# SAMEA103885182 SAMEA103885182 eiwy_A eiwy_1 naive
# ... ... ... ... ...
# SAMEA103885021 SAMEA103885021 podx_D podx_1 IFNg_SL1344
# SAMEA103885262 SAMEA103885262 qaqx_A qaqx_1 naive
# SAMEA103885228 SAMEA103885228 qaqx_B qaqx_1 IFNg
# SAMEA103885308 SAMEA103885308 qaqx_C qaqx_1 SL1344
# SAMEA103884949 SAMEA103884949 qaqx_D qaqx_1 IFNg_SL1344
We have samples from four different conditions, and six donors:
table(colData(sg)$condition_name)
#
# IFNg IFNg_SL1344 naive SL1344
# 6 6 6 6
table(colData(sg)$line_id)
#
# diku_1 eiwy_1 fikt_3 ieki_2 podx_1 qaqx_1
# 4 4 4 4 4 4
# possible to use a shortcut
table(sg$line_id)
#
# diku_1 eiwy_1 fikt_3 ieki_2 podx_1 qaqx_1
# 4 4 4 4 4 4
We want to find the changes in gene expression that can be associated with
the different treatments, but we also want to control for differences between the
donors. The design which accomplishes this is obtained by writing
~ line_id + condition_name. By including line_id, terms will be
added to the model which
account for differences across donors, and by adding condition_name we get
terms representing the different treatment effects.
Note: it will be helpful for us if the first level of a factor is the reference level (e.g. control, or untreated samples). The reason is that by specifying this, functions further in the pipeline can be used and will give comparisons such as ‘treatment vs control’, without needing to specify additional arguments.
We can relevel the condition_name factor like so:
colData(sg)$condition_name <- factor(colData(sg)$condition_name)
colData(sg)$condition_name <- relevel(colData(sg)$condition_name, ref = "naive")
colData(sg)$condition_name
# [1] naive IFNg SL1344 IFNg_SL1344 naive IFNg
# [7] SL1344 IFNg_SL1344 naive IFNg SL1344 IFNg_SL1344
# [13] naive IFNg SL1344 IFNg_SL1344 naive IFNg
# [19] SL1344 IFNg_SL1344 naive IFNg SL1344 IFNg_SL1344
# Levels: naive IFNg IFNg_SL1344 SL1344
You can use R’s formula notation to express any fixed-effects experimental
design for edgeR or DESeq2. Note that these packages use the same formula
notation as, for instance, the lm function of base R.
Using the ExploreModelMatrix R/Bioconductor package, we can represent our design in a graphical way:
library(ExploreModelMatrix)
vd <- VisualizeDesign(sampleData = colData(sg),
designFormula = ~ line_id + condition_name,
textSizeFitted = 4)
vd$plotlist
# [[1]]
vd$cooccurrenceplots
# [[1]]
We can also open the interactive interface to explore our design further:
ExploreModelMatrix(sampleData = colData(sg),
designFormula = ~ line_id + condition_name)
To generate a DESeqDataSet object from a SummarizedExperiment object, we only need to additionally provide the experimental design in terms of a formula.
dds <- DESeqDataSet(sg, design = ~ line_id + condition_name)
We can also create a DESeqDataSet directly from a count matrix, a data frame
with sample information and a design formula (see the DESeqDataSetFromMatrix
function).
As mentioned above, the edgeR package uses another type of data container, namely a DGEList object. tximeta provides a convenient wrapper function to generate a DGEList from the gene-level SummarizedExperiment object:
library(edgeR)
dge <- tximeta::makeDGEList(sg)
names(dge)
# [1] "counts" "samples" "genes" "offset"
head(dge$samples)
# group lib.size norm.factors names sample_id line_id
# SAMEA103885102 1 40074879 1 SAMEA103885102 diku_A diku_1
# SAMEA103885347 1 40467661 1 SAMEA103885347 diku_B diku_1
# SAMEA103885043 1 41832780 1 SAMEA103885043 diku_C diku_1
# SAMEA103885392 1 42535180 1 SAMEA103885392 diku_D diku_1
# SAMEA103885182 1 40738502 1 SAMEA103885182 eiwy_A eiwy_1
# SAMEA103885136 1 39701890 1 SAMEA103885136 eiwy_B eiwy_1
# condition_name
# SAMEA103885102 naive
# SAMEA103885347 IFNg
# SAMEA103885043 SL1344
# SAMEA103885392 IFNg_SL1344
# SAMEA103885182 naive
# SAMEA103885136 IFNg
As for the DESeqDataSet, a DGEList can also be generated
directly from a count matrix and sample metadata (see the DGEList()
constructor function).
Just like the SummarizedExperiment and the DESeqDataSet, the DGEList
contains all the information we need: the count matrix, information about the
samples (the columns of the count matrix), and information about the genes (the
rows of the count matrix). One difference compared to the DESeqDataSet is that
the experimental design is not defined when creating the DGEList, but later in
the workflow.
It is often helpful to filter out lowly expressed genes before continuing with the analysis, to remove features that have nearly no information, increase the speed of the analysis and reduce the size of the data. At the very least we exclude genes with zero counts across all samples.
nrow(dds)
# [1] 58294
table(rowSums(assay(dds, "counts")) == 0)
#
# FALSE TRUE
# 38829 19465
Here, we additionally remove genes that have a single read across the samples.
keep <- rowSums(counts(dds)) > 1
dds <- dds[keep, ]
dge <- dge[match(rownames(dds), rownames(dge)), ]
dim(dds)
# [1] 35683 24
dim(dge)
# [1] 35683 24
Importantly, the group information should not be used to define the filtering criterion, since that can interfere with the validity of the p-values downstream.
There are two separate analysis paths in this tutorial:
Importantly, the statistical testing methods rely on original count data (not scaled or transformed) for calculating the precision of measurements. However, for visualization and exploratory analysis, transformed counts are typically more suitable. Thus, it is critical to separate the two workflows and use the appropriate input data for each of them.
Many common statistical methods for exploratory analysis of multidimensional data, for example clustering and principal components analysis (PCA), work best for data that generally has the same range of variance at different ranges of the mean values. When the expected amount of variance is approximately the same across different mean values, the data is said to be homoskedastic. For RNA-seq raw counts, however, the variance grows with the mean. For example, if one performs PCA directly on a matrix of size-factor-normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples. A simple and often used strategy to avoid this is to take the logarithm of the normalized count values plus a small pseudocount; however, now the genes with the very lowest counts will tend to dominate the results because, due to the strong Poisson noise inherent to small count values, and the fact that the logarithm amplifies differences for the smallest values, these low count genes will show the strongest relative differences between samples.
As a solution, DESeq2 offers transformations for count data that stabilize the
variance across the mean: the regularized logarithm (rlog) and the variance
stabilizing transformation (VST). These have slightly different
implementations, discussed a bit in the DESeq2 paper and in the vignette, but
a similar goal of stabilizing the variance across the range of values. Both
produce log2-like values for high counts. Here we will use the variance
stabilizing transformation implemented with the vst function.
vsd <- DESeq2::vst(dds)
This returns a DESeqTransform object…
class(vsd)
# [1] "DESeqTransform"
# attr(,"package")
# [1] "DESeq2"
…which retains all the column metadata that was attached to the DESeqDataSet:
head(colData(vsd), 3)
# DataFrame with 3 rows and 4 columns
# names sample_id line_id condition_name
# <character> <character> <factor> <factor>
# SAMEA103885102 SAMEA103885102 diku_A diku_1 naive
# SAMEA103885347 SAMEA103885347 diku_B diku_1 IFNg
# SAMEA103885043 SAMEA103885043 diku_C diku_1 SL1344
One way to visualize sample-to-sample distances is a principal components analysis (PCA). In this ordination method, the data points (here, the samples) are projected onto the 2D plane such that they spread out in the two directions that explain most of the differences (Figure below). The x-axis (the first principal component, or PC1) is the direction that separates the data points the most (i.e., the direction with the largest variance). The y-axis (the second principal component, or PC2) represents the direction with largest variance subject to the constraint that it must be orthogonal to the first direction. The percent of the total variance that is contained in the direction is printed in the axis label. Note that these percentages do not sum to 100%, because there are more dimensions that contain the remaining variance (although each of these remaining dimensions will explain less than the two that we see).
DESeq2::plotPCA(vsd, intgroup = "condition_name")
Additionally, the pcaExplorer package has some functionality on top to explore datasets from the point of view of Principal Components - including also a functional interpretation of it with the pca2go() function.
library(pcaExplorer)
pcaplot(vsd, intgroup = "condition_name", ellipse = TRUE)
Another way to reduce dimensionality, which is in many ways similar to PCA, is multidimensional scaling (MDS). For MDS, we first have to calculate all pairwise distances between our objects (samples in this case), and then create a (typically) two-dimensional representation where these pre-calculated distances are represented as accurately as possible. This means that depending on how the pairwise sample distances are defined, the two-dimensional plot can be very different, and it is important to choose a distance that is suitable for the type of data at hand.
edgeR contains a function plotMDS, which operates on a DGEList object and
generates a two-dimensional MDS representation of the samples. The default
distance between two samples can be interpreted as the “typical” log fold change
between the two samples, for the genes that are most different between them (by
default, the top 500 genes, but this can be modified). We generate an MDS plot
from the DGEList object dge, coloring by the treatment and using different
plot symbols for different donors.
Note: Since the DGEList was created using the makeDGEList function,
the average transcript length offsets have been incorporated in the object and
will be used as offsets in downstream analysis. If this is not the case,
we need to estimate TMM normalization factors before performing further analysis.
# dge <- edgeR::calcNormFactors(dge)
plotMDS(dge, top = 500, labels = NULL,
col = as.numeric(dge$samples$condition_name),
cex = 0.5, gene.selection = "common")
As we have already specified an experimental design when we created the
DESeqDataSet, we can run the differential expression pipeline on the raw
counts with a single call to the function DESeq. We can also plot the
estimated dispersions.
dds <- DESeq2::DESeq(dds)
DESeq2::plotDispEsts(dds)
The DESeq function will print out a message for the various steps it performs. These
are described in more detail in the manual page, which can be
accessed by typing ?DESeq. Briefly these are: the estimation of size factors
(controlling for differences in the sequencing depth of the samples), the
estimation of dispersion values for each gene, and fitting a generalized linear
model.
A DESeqDataSet is returned that contains all the fitted parameters within it, and the following section describes how to extract out results tables of interest from this object.
Calling the results function without any arguments will extract the estimated log2
fold changes and p values for the last variable in the design formula. If
there are more than 2 levels for this variable, results will extract the
results table for a comparison of the last level over the first level. This
comparison is printed at the top of the output: condition name SL1344 vs naive.
Other comparisons can be performed via the contrast argument. For example,
we will focus on comparing the IFN gamma treatment to the naive group.
## Default - SL1344 vs naive
res <- DESeq2::results(dds)
head(res)
# log2 fold change (MLE): condition name SL1344 vs naive
# Wald test p-value: condition name SL1344 vs naive
# DataFrame with 6 rows and 6 columns
# baseMean log2FoldChange lfcSE stat pvalue
# <numeric> <numeric> <numeric> <numeric> <numeric>
# ENSG00000000003.14 171.782 0.1171248 0.3008327 0.389335 6.97028e-01
# ENSG00000000419.12 967.527 0.0886824 0.0860008 1.031181 3.02456e-01
# ENSG00000000457.13 681.637 0.7109442 0.1973877 3.601766 3.16062e-04
# ENSG00000000460.16 263.282 -1.0347169 0.2179499 -4.747499 2.05947e-06
# ENSG00000000938.12 2646.887 1.6453083 0.2348000 7.007275 2.43005e-12
# ENSG00000000971.15 3045.742 0.7794411 0.4980265 1.565059 1.17569e-01
# padj
# <numeric>
# ENSG00000000003.14 8.11132e-01
# ENSG00000000419.12 4.55113e-01
# ENSG00000000457.13 1.22290e-03
# ENSG00000000460.16 1.18399e-05
# ENSG00000000938.12 3.14455e-11
# ENSG00000000971.15 2.20491e-01
## We'll instead focus on IFNgamma vs naive
res <- DESeq2::results(dds, contrast = c("condition_name", "IFNg", "naive"))
head(res)
# log2 fold change (MLE): condition_name IFNg vs naive
# Wald test p-value: condition name IFNg vs naive
# DataFrame with 6 rows and 6 columns
# baseMean log2FoldChange lfcSE stat pvalue
# <numeric> <numeric> <numeric> <numeric> <numeric>
# ENSG00000000003.14 171.782 -0.2829860 0.3010930 -0.939862 3.47288e-01
# ENSG00000000419.12 967.527 0.0383933 0.0856623 0.448194 6.54013e-01
# ENSG00000000457.13 681.637 1.2838945 0.1966270 6.529593 6.59486e-11
# ENSG00000000460.16 263.282 -1.4725128 0.2183088 -6.745092 1.52930e-11
# ENSG00000000938.12 2646.887 0.6747921 0.2351631 2.869464 4.11168e-03
# ENSG00000000971.15 3045.742 4.9869519 0.4966828 10.040518 1.01142e-23
# padj
# <numeric>
# ENSG00000000003.14 5.77116e-01
# ENSG00000000419.12 8.25573e-01
# ENSG00000000457.13 1.62287e-09
# ENSG00000000460.16 4.10702e-10
# ENSG00000000938.12 1.89738e-02
# ENSG00000000971.15 1.13504e-21
As res is a DataFrame object, it carries metadata with information on the
meaning of the columns:
mcols(res, use.names = TRUE)
# DataFrame with 6 rows and 2 columns
# type description
# <character> <character>
# baseMean intermediate mean of normalized c..
# log2FoldChange results log2 fold change (ML..
# lfcSE results standard error: cond..
# stat results Wald statistic: cond..
# pvalue results Wald test p-value: c..
# padj results BH adjusted p-values
The first column, baseMean, is a just the average of the normalized count
values, dividing by size factors, taken over all samples in the DESeqDataSet.
The remaining four columns refer to a specific contrast, namely the comparison
of the IFNg level over the naive level for the factor variable
condition_name.
The column log2FoldChange is the effect size estimate. It tells us how much
the gene’s expression seems to have changed due to infection with IFN gamma
in comparison to naive samples. This value is reported on a logarithmic
scale to base 2: for example, a log2 fold change of 1.5 means that the gene’s
expression is increased by a multiplicative factor of 2^1.5.
Of course, this estimate has an uncertainty associated with it, which is
available in the column lfcSE, the standard error estimate for the log2 fold
change estimate. We can also express the uncertainty of a particular effect
size estimate as the result of a statistical test. The purpose of a test for
differential expression is to test whether the data provides sufficient evidence
to conclude that this value is really different from zero. DESeq2 performs for
each gene a hypothesis test to see whether evidence is sufficient to decide
against the null hypothesis that there is zero effect of the treatment on the
gene and that the observed difference between treatment and control was merely
caused by experimental variability (i.e., the type of variability that you can
expect between different samples in the same treatment group). As usual in
statistics, the result of this test is reported as a p value, and it is found
in the column pvalue. Remember that a p value indicates the probability that
an effect as strong as the observed one, or even stronger, would be seen under
the situation described by the null hypothesis.
We can also summarize the results with the following line of code, which reports some additional information, that will be covered in later sections.
summary(res)
#
# out of 35683 with nonzero total read count
# adjusted p-value < 0.1
# LFC > 0 (up) : 3718, 10%
# LFC < 0 (down) : 3437, 9.6%
# outliers [1] : 0, 0%
# low counts [2] : 12453, 35%
# (mean count < 2)
# [1] see 'cooksCutoff' argument of ?results
# [2] see 'independentFiltering' argument of ?results
hist(res$pvalue)
## Remove the genes that were filtered out in the independent filtering
hist(res$pvalue[!is.na(res$padj)])
## We also add a couple of extra columns that will be useful for the interactive
## visualization later
rowData(dds)$log10Dispersion <- log10(rowData(dds)$dispersion)
restmp <- DataFrame(res)
restmp$log10BaseMean <- log10(restmp$baseMean)
restmp$mlog10PValue <- -log10(restmp$pvalue)
colnames(restmp) <- paste0("DESeq2_IFNg_vs_naive_", colnames(restmp))
rowData(dds) <- cbind(rowData(dds), restmp)
Note that there are many genes with differential expression due to IFN gamma treatment at the FDR level of 10%. There are two ways to be more strict about which set of genes are considered significant:
padj in the
results table)lfcThreshold argument
of resultsIf we lower the false discovery rate threshold, we should also tell this value
to results(), so that the function will use an alternative threshold for the
optimal independent filtering step:
res.05 <- results(dds, alpha = 0.05,
contrast = c("condition_name", "IFNg", "naive"))
table(res.05$padj < 0.05)
#
# FALSE TRUE
# 16423 6115
If we want to raise the log2 fold change threshold, so that we test for genes
that show more substantial changes due to treatment, we simply supply a value on
the log2 scale. For example, by specifying lfcThreshold = 1, we look for genes
that show significant effects of treatment on gene counts more than doubling or
less than halving, because 2^1 = 2.
resLFC1 <- results(dds, lfcThreshold = 1,
contrast = c("condition_name", "IFNg", "naive"))
summary(resLFC1)
#
# out of 35683 with nonzero total read count
# adjusted p-value < 0.1
# LFC > 1.00 (up) : 687, 1.9%
# LFC < -1.00 (down) : 419, 1.2%
# outliers [1] : 0, 0%
# low counts [2] : 0, 0%
# (mean count < 0)
# [1] see 'cooksCutoff' argument of ?results
# [2] see 'independentFiltering' argument of ?results
table(resLFC1$padj < 0.1)
#
# FALSE TRUE
# 34577 1106
Sometimes a subset of the p values in res will be NA (“not available”).
This is DESeq’s way of reporting that all counts for this gene were zero, and
hence no test was applied. In addition, p values can be assigned NA if the
gene was excluded from analysis because it contained an extreme count outlier.
For more information, see the outlier detection section of the DESeq2
vignette.
With DESeq2, there is also an easy way to plot the (normalized, transformed)
counts for specific genes, using the plotCounts function:
plotCounts(dds, gene = "ENSG00000126561.16", intgroup = "condition_name",
normalized = TRUE, transform = FALSE)
Next we will show how to perform differential expression analysis with edgeR.
Recall that we have a DGEList dge, containing all the necessary information:
names(dge)
# [1] "counts" "samples" "genes" "offset"
We first define a design matrix, using the same formula syntax as for DESeq2 above.
design <- model.matrix(~ line_id + condition_name, data = dge$samples)
head(design)
# (Intercept) line_ideiwy_1 line_idfikt_3 line_idieki_2
# SAMEA103885102 1 0 0 0
# SAMEA103885347 1 0 0 0
# SAMEA103885043 1 0 0 0
# SAMEA103885392 1 0 0 0
# SAMEA103885182 1 1 0 0
# SAMEA103885136 1 1 0 0
# line_idpodx_1 line_idqaqx_1 condition_nameIFNg
# SAMEA103885102 0 0 0
# SAMEA103885347 0 0 1
# SAMEA103885043 0 0 0
# SAMEA103885392 0 0 0
# SAMEA103885182 0 0 0
# SAMEA103885136 0 0 1
# condition_nameIFNg_SL1344 condition_nameSL1344
# SAMEA103885102 0 0
# SAMEA103885347 0 0
# SAMEA103885043 0 1
# SAMEA103885392 1 0
# SAMEA103885182 0 0
# SAMEA103885136 0 0
While DESeq2 performs independent filtering of lowly expressed genes
internally, this is done by the user before applying edgeR. Here, we filter
out lowly expressed genes using the filterByExpr() function, and then estimate
the dispersion for each gene. Note that it is important that we specify the
design in the dispersion calculation (it will be used to determine a
suitable number of samples to require a gene to be expressed in).
Afterwards, we plot the estimated dispersions.
keep <- edgeR::filterByExpr(dge, design)
dge <- dge[keep, ]
dge <- edgeR::estimateDisp(dge, design)
edgeR::plotBCV(dge)
Finally, we fit the generalized linear model and perform the test. In the
glmQLFTest function, we indicate which coefficient (which column in the design
matrix) that we would like to test for. It is possible to test more general
contrasts as well, and the user guide contains many examples on how to do this.
The topTags function extracts the top-ranked genes. You can indicate the
adjusted p-value cutoff, and/or the number of genes to keep.
fit <- edgeR::glmQLFit(dge, design)
qlf <- edgeR::glmQLFTest(fit, coef = "condition_nameIFNg")
tt.all <- edgeR::topTags(qlf, n = nrow(dge), sort.by = "none") # all genes
hist(tt.all$table$PValue)
tt <- edgeR::topTags(qlf, n = nrow(dge), p.value = 0.1) # genes with adj.p<0.1
tt10 <- edgeR::topTags(qlf) # just the top 10 by default
tt10
# Coefficient: condition_nameIFNg
# gene_id
# ENSG00000111181.12 ENSG00000111181.12
# ENSG00000125347.13 ENSG00000125347.13
# ENSG00000137496.17 ENSG00000137496.17
# ENSG00000204257.14 ENSG00000204257.14
# ENSG00000162645.12 ENSG00000162645.12
# ENSG00000145365.10 ENSG00000145365.10
# ENSG00000174944.8 ENSG00000174944.8
# ENSG00000204267.13 ENSG00000204267.13
# ENSG00000134470.20 ENSG00000134470.20
# ENSG00000100911.15 ENSG00000100911.15
# tx_ids
# ENSG00000111181.12 ENST00000397296.6, ENST00000359674.8, ENST00000545058.5, ENST00000424061.6, ENST00000536824.5, ENST00000542825.5, ENST00000535498.5, ENST00000544782.1, ENST00000538272.5, ENST00000540094.1, ENST00000538580.1, ENST00000536116.5, ENST00000537793.1, ENST00000535347.5, ENST00000537826.1, ENST00000538424.1
# ENSG00000125347.13 ENST00000245414.8, ENST00000472045.1, ENST00000405885.6, ENST00000613424.4, ENST00000437654.5, ENST00000459982.5, ENST00000458069.5, ENST00000463784.5, ENST00000439555.2, ENST00000476613.1, ENST00000493208.1
# ENSG00000137496.17 ENST00000393703.8, ENST00000497194.6, ENST00000393705.8, ENST00000525932.5, ENST00000414358.2, ENST00000531777.1, ENST00000337131.9, ENST00000620017.4, ENST00000531053.5, ENST00000343898.9, ENST00000404792.5, ENST00000534583.5, ENST00000260049.9, ENST00000393707.4
# ENSG00000204257.14 ENST00000480785.5, ENST00000395305.7, ENST00000395303.7, ENST00000477541.1, ENST00000374843.8, ENST00000464392.1, ENST00000456800.1, ENST00000422832.1, ENST00000475627.1
# ENSG00000162645.12 ENST00000464839.5, ENST00000370466.3, ENST00000493802.5, ENST00000463660.1
# ENSG00000145365.10 ENST00000361717.3, ENST00000500655.2
# ENSG00000174944.8 ENST00000309170.7, ENST00000424796.6, ENST00000494668.1
# ENSG00000204267.13 ENST00000374899.8, ENST00000620123.4, ENST00000374897.2, ENST00000464100.1, ENST00000485701.1
# ENSG00000134470.20 ENST00000620345.4, ENST00000435171.6, ENST00000397251.7, ENST00000379977.7, ENST00000397248.6, ENST00000525219.6, ENST00000534292.5, ENST00000532948.5, ENST00000379972.6, ENST00000528354.5, ENST00000447291.5, ENST00000379974.1, ENST00000532039.5, ENST00000397250.6, ENST00000379971.5, ENST00000397246.7, ENST00000397255.7, ENST00000530685.5, ENST00000622442.4, ENST00000620865.4, ENST00000429135.2, ENST00000453922.1
# ENSG00000100911.15 ENST00000559005.2, ENST00000560410.5, ENST00000615264.4, ENST00000216802.9, ENST00000471700.6, ENST00000559042.1, ENST00000559453.5, ENST00000558273.5, ENST00000558931.5, ENST00000560370.3, ENST00000559359.1, ENST00000559056.5, ENST00000559493.5, ENST00000560592.5, ENST00000560788.1, ENST00000559613.1, ENST00000630027.1, ENST00000561103.1
# SYMBOL
# ENSG00000111181.12 SLC6A12
# ENSG00000125347.13 IRF1
# ENSG00000137496.17 IL18BP
# ENSG00000204257.14 HLA-DMA
# ENSG00000162645.12 GBP2
# ENSG00000145365.10 TIFA
# ENSG00000174944.8 P2RY14
# ENSG00000204267.13 TAP2
# ENSG00000134470.20 IL15RA
# ENSG00000100911.15 PSME2
# GO
# ENSG00000111181.12 GO:0003333, GO:0005332, GO:0005515, GO:0005886, GO:0005887, GO:0006865, GO:0008028, GO:0015171, GO:0015718, GO:0016021, GO:0035725, GO:0043005, GO:0051936, GO:0098793
# ENSG00000125347.13 GO:0000785, GO:0000785, GO:0000976, GO:0000977, GO:0000978, GO:0000978, GO:0000978, GO:0000981, GO:0000981, GO:0000981, GO:0001228, GO:0001228, GO:0002376, GO:0002819, GO:0003677, GO:0005515, GO:0005634, GO:0005634, GO:0005634, GO:0005654, GO:0005737, GO:0005829, GO:0006357, GO:0006915, GO:0008285, GO:0032481, GO:0032728, GO:0032735, GO:0034124, GO:0035458, GO:0043374, GO:0045088, GO:0045590, GO:0045892, GO:0045893, GO:0045893, GO:0045944, GO:0045944, GO:0051607, GO:0051726, GO:0051726, GO:0060333, GO:0071260, GO:2000564
# ENSG00000137496.17 GO:0005576, GO:0005615, GO:0032496, GO:0042007, GO:0042007, GO:0042088, GO:0042088, GO:0048019, GO:0070062, GO:0070301, GO:0071356, GO:2000272
# ENSG00000204257.14 GO:0002250, GO:0002381, GO:0002503, GO:0002503, GO:0005515, GO:0005765, GO:0005765, GO:0005765, GO:0009986, GO:0016020, GO:0016021, GO:0019886, GO:0023026, GO:0023026, GO:0031902, GO:0031902, GO:0042613, GO:0042613, GO:0043231, GO:0050870
# ENSG00000162645.12 GO:0000139, GO:0003924, GO:0005515, GO:0005525, GO:0005634, GO:0005654, GO:0005737, GO:0005794, GO:0005829, GO:0005829, GO:0006955, GO:0031410, GO:0031410, GO:0034504, GO:0042802, GO:0042803, GO:0042832, GO:0048471, GO:0050830, GO:0071346, GO:0071346, GO:0071347, GO:0071356
# ENSG00000145365.10 GO:0002753, GO:0002753, GO:0005515, GO:0005737, GO:0005737, GO:0005829, GO:0007249, GO:0043123, GO:0043123, GO:0045087, GO:0045087, GO:0051260
# ENSG00000174944.8 GO:0005886, GO:0007186, GO:0007186, GO:0016021, GO:0035589, GO:0045028, GO:0045029
# ENSG00000204267.13 GO:0002250, GO:0002489, GO:0005515, GO:0005524, GO:0005524, GO:0005783, GO:0005789, GO:0015031, GO:0015433, GO:0015433, GO:0015433, GO:0015440, GO:0015833, GO:0016020, GO:0016021, GO:0016021, GO:0016607, GO:0019885, GO:0019885, GO:0019885, GO:0023029, GO:0030176, GO:0030670, GO:0033116, GO:0042288, GO:0042605, GO:0042626, GO:0042824, GO:0042824, GO:0042825, GO:0046872, GO:0046967, GO:0046968, GO:0046968, GO:0046978, GO:0046978, GO:0046980, GO:0055085, GO:1904680
# ENSG00000134470.20 GO:0000139, GO:0004896, GO:0005515, GO:0005615, GO:0005768, GO:0005789, GO:0005886, GO:0005886, GO:0009986, GO:0016021, GO:0019901, GO:0030659, GO:0031965, GO:0032825, GO:0035723, GO:0035723, GO:0042010, GO:0042010, GO:0050766
# ENSG00000100911.15 GO:0000502, GO:0005515, GO:0005654, GO:0005654, GO:0005737, GO:0005829, GO:0008537, GO:0010950, GO:0016020, GO:0042802, GO:0061133, GO:0061136, GO:0070062, GO:2000045
# logFC logCPM F PValue FDR
# ENSG00000111181.12 4.705110 4.271224 505.8824 4.181499e-15 8.040187e-11
# ENSG00000125347.13 5.552412 9.415299 462.2690 9.509923e-15 9.142840e-11
# ENSG00000137496.17 4.045715 7.356263 404.7655 3.174984e-14 2.034953e-10
# ENSG00000204257.14 4.062854 5.544983 378.5629 5.813342e-14 2.794473e-10
# ENSG00000162645.12 6.663163 9.603736 354.0817 1.061790e-13 3.575685e-10
# ENSG00000145365.10 5.188439 6.703715 352.1349 1.115774e-13 3.575685e-10
# ENSG00000174944.8 9.807319 5.276214 338.5682 1.588138e-13 4.104064e-10
# ENSG00000204267.13 3.452324 8.021179 332.2798 1.879065e-13 4.104064e-10
# ENSG00000134470.20 4.293508 6.548007 331.4633 1.920979e-13 4.104064e-10
# ENSG00000100911.15 3.354760 8.025720 310.9602 3.402208e-13 6.541765e-10
The columns in the edgeR result data frame are similar to the ones output by
DESeq2. edgeR represents the overall expression level on the log-CPM scale
rather than on the normalized count scale that DESeq2 uses. The F column
contains the test statistic, and the FDR column contains the
Benjamini-Hochberg adjusted p-values.
We can compare the sets of significantly differentially expressed genes to see how the results from the two packages overlap:
shared <- intersect(rownames(res), rownames(tt.all$table))
table(DESeq2 = res$padj[match(shared, rownames(res))] < 0.1,
edgeR = tt.all$table$FDR[match(shared, rownames(tt.all$table))] < 0.1)
# edgeR
# DESeq2 FALSE TRUE
# FALSE 12175 98
# TRUE 434 6520
We can also compare the two result lists by the ranks:
plot(rank(res$pvalue[match(shared, rownames(res))]),
rank(tt.all$table$PValue[match(shared, rownames(tt.all$table))]),
cex = 0.1, xlab = "DESeq2", ylab = "edgeR")
Also with edgeR we can test for significance relative to a fold-change
threshold, using the function glmTreat. Below we set the log fold-change
threshold to 1 (i.e., fold change threshold equal to 2), as for DESeq2 above.
treatres <- edgeR::glmTreat(fit, coef = "condition_nameIFNg", lfc = 1)
tt.treat <- edgeR::topTags(treatres, n = nrow(dge), sort.by = "none")
An MA-plot (Dudoit et al. 2002) provides a useful overview for an experiment with a two-group comparison. The log2 fold change for a particular comparison is plotted on the y-axis and the average of the counts normalized by size factor is shown on the x-axis (“M” for minus, because a log ratio is equal to log minus log, and “A” for average). Each gene is represented with a dot. Genes with an adjusted p value below a threshold (here 0.1, the default with DESeq2) are shown in color
DESeq2::plotMA(res, ylim = c(-5, 5))
We see that there are many genes with low expression levels that nevertheless
have large fold changes (since we are, effectively, dividing by a small
number). To get more interpretable log fold changes (e.g., for ranking genes),
we use the lfcShrink function to shrink the log2
fold changes for the comparison of IFN gamma-treated vs naive samples. There are
three types of shrinkage estimators in DESeq2, which are covered in the
vignette. Here we specify the apeglm method for shrinking coefficients, which
is good for shrinking the noisy LFC estimates while giving low bias LFC
estimates for true large differences (Zhu, Ibrahim, and Love 2019). To use apeglm we specify
a coefficient from the model to shrink, either by name or number as the
coefficient appears in resultsNames(dds).
library(apeglm)
DESeq2::resultsNames(dds)
# [1] "Intercept" "line_id_eiwy_1_vs_diku_1"
# [3] "line_id_fikt_3_vs_diku_1" "line_id_ieki_2_vs_diku_1"
# [5] "line_id_podx_1_vs_diku_1" "line_id_qaqx_1_vs_diku_1"
# [7] "condition_name_IFNg_vs_naive" "condition_name_IFNg_SL1344_vs_naive"
# [9] "condition_name_SL1344_vs_naive" "DESeq2_IFNg_vs_naive_log2FoldChange"
resape <- DESeq2::lfcShrink(dds, coef = "condition_name_IFNg_vs_naive", type = "apeglm")
DESeq2::plotMA(resape, ylim = c(-5, 5))
In edgeR, the MA plot is obtained via the plotSmear function.
edgeR::plotSmear(qlf, de.tags = rownames(tt$table))
Another way of representing the results of a differential expression analysis is to construct a heatmap of the top differentially expressed genes. A heatmap is a “color coded expression matrix”, where the rows and columns are clustered using hierarchical clustering. Typically, it should not be applied to counts, but works better with transformed values. Here we show how it can be applied to the variance-stabilized values generated above. We would expect the contrasted sample groups to cluster separately (“by construction”, since the genes were selected to be most discriminative between the groups). The heatmap will allow us to display, e.g., the variability within the groups of the differentially expressed genes. We choose the top 30 differentially expressed genes. There are many functions in R that can generate heatmaps, here we show the one from the pheatmap package.
library(pheatmap)
stopifnot(rownames(vsd) == rownames(res))
mat <- assay(vsd)
rownames(mat) <- rowData(vsd)$SYMBOL
mat <- mat[head(order(res$padj), 30), ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(vsd)[, c("condition_name"), drop = FALSE])
pheatmap(mat, annotation_col = df)
We can of course also create heatmaps for other sets of genes - for example, the collection of genes with the highest overall variance (which may or may not indicate a difference between the groups - in this particular case most of the highly variable genes show a clear difference between the groups).
mat <- assay(vsd)
rownames(mat) <- rowData(vsd)$SYMBOL
topVarGenes <- head(order(rowVars(mat), decreasing = TRUE), 30)
mat <- mat[topVarGenes, ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(vsd)[, c("condition_name"), drop = FALSE])
pheatmap(mat, annotation_col = df)
TODO: see the “stripes” in some genes over all the samples
iSEE is a Bioconductor package that allows interactive exploration of any data
stored in a SummarizedExperiment container, or any class extending this (such
as, e.g., the DESeqDataSet class, or the SingleCellExperiment for
single-cell data). By calling the iSEE() function with the object as the first
argument, an interactive application will be opened, in which all observed
values as well as metadata columns (rowData and colData) can be explored.
library(iSEE)
library(iSEEu)
dds <- iSEEu::registerAveAbPatterns(dds, "log10BaseMean")
dds <- iSEEu::registerLogFCPatterns(dds, "log2FoldChange")
dds <- iSEEu::registerPValuePatterns(dds, "pvalue")
app <- iSEE(dds, initial = list(MAPlot(), VolcanoPlot(),
RowDataTable(), FeatureAssayPlot()))
## shiny::runApp(app)
You can easily save the results table in a CSV file that you can then share or
load with a spreadsheet program such as Excel (note, however, that Excel
sometimes does funny things to gene identifiers (Zeeberg et al. 2004; Ziemann, Eren, and El-Osta 2016)). The call to as.data.frame is necessary to convert the
DataFrame object to a data.frame object
that can be processed by write.csv. Here, we first show how to add gene
symbols to the output table, and then export just the top 100 genes for
demonstration.
stopifnot(all(rownames(res) == rownames(dds)))
res$symbol <- rowData(dds)$SYMBOL
resOrdered <- res[order(res$padj), ]
head(resOrdered)
# log2 fold change (MLE): condition_name IFNg vs naive
# Wald test p-value: condition name IFNg vs naive
# DataFrame with 6 rows and 7 columns
# baseMean log2FoldChange lfcSE stat pvalue
# <numeric> <numeric> <numeric> <numeric> <numeric>
# ENSG00000125347.13 30487.254 5.55915 0.218390 25.4551 6.19761e-143
# ENSG00000111181.12 687.519 4.70999 0.195911 24.0415 1.02514e-127
# ENSG00000162645.12 36639.987 6.66498 0.286603 23.2551 1.26442e-119
# ENSG00000137496.17 7118.885 4.05787 0.177049 22.9195 2.97302e-116
# ENSG00000145365.10 3642.657 5.19246 0.237141 21.8961 2.82829e-106
# ENSG00000204257.14 1906.261 4.07091 0.190575 21.3612 3.06785e-101
# padj symbol
# <numeric> <character>
# ENSG00000125347.13 1.43971e-138 IRF1
# ENSG00000111181.12 1.19070e-123 SLC6A12
# ENSG00000162645.12 9.79080e-116 GBP2
# ENSG00000137496.17 1.72658e-112 IL18BP
# ENSG00000145365.10 1.31402e-102 TIFA
# ENSG00000204257.14 1.18777e-97 HLA-DMA
resOrderedDF <- as.data.frame(resOrdered)[seq_len(100), ]
write.table(cbind(id = rownames(resOrderedDF), resOrderedDF),
file = "results.txt", quote = FALSE, sep = "\t",
row.names = FALSE)
In order to interpret the differential expression analysis results in terms of known gene sets, we can also apply a functional enrichment test. There are many alternatives to perform enrichment analysis in the context of R and Bioconductor. Among these, topGO and clusterProfiler are two popular options.
We can perform the analysis by following the instructions in the next chunk. In each case, the gene sets that are tested for enrichment are obtained from the Gene Ontology ‘biological process’ catalog.
library(GeneTonic)
de_symbols_IFNg_vs_naive <- deseqresult2df(res, FDR = 0.05)$symbol
bg_ids <- rowData(dds)$SYMBOL[rowSums(counts(dds)) > 0]
library(topGO)
topgo_DE_macrophage_IFNg_vs_naive <- pcaExplorer::topGOtable(
DEgenes = de_symbols_IFNg_vs_naive,
BGgenes = bg_ids,
ontology = "BP",
mapping = "org.Hs.eg.db",
geneID = "symbol",
topTablerows = 500
)
library(clusterProfiler)
clupro_DE_macrophage_IFNg_vs_naive <- clusterProfiler::enrichGO(
gene = de_symbols_IFNg_vs_naive,
universe = bg_ids,
keyType = "SYMBOL",
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = FALSE
)
“Since it is bioinformatics”, every software package can be expected to return a (slightly) different-but-similar-in-content output format. To simplify the interpretation of transcriptome datasets, the GeneTonic package offers an interactive application to explore in depth all the workflow results.
As a first step, we convert the output of each tool to a consolidated “standard” format, as expected by GeneTonic - this is the first step to construct a GeneTonicList object, as a single container to perform all operations on afterwards, be it in the app or offline by using its functionality in scripts/notebooks.
res_enrich_topGO <- shake_topGOtableResult(topgo_DE_macrophage_IFNg_vs_naive)
res_enrich_clupro <- shake_enrichResult(clupro_DE_macrophage_IFNg_vs_naive)
gtl_macrophage <- GeneTonicList(
dds = dds,
res_de = res,
res_enrich = res_enrich_clupro,
annotation_obj = data.frame(
gene_id = rowData(dds)$gene_id,
gene_name = rowData(dds)$SYMBOL
)
)
## we can store this object as serialized file to load/share/...
saveRDS(gtl_macrophage, "gtl_macrophage.RDS")
After that, we would simply have to call the GeneTonic() function specifying the gtl parameter - this can also be passed at runtime
## and that is it!
GeneTonic(gtl = gtl_macrophage)
## or if expecting to upload at runtime... (e.g. used as a server-like app)
GeneTonic()
Next, we perform a differential transcript expression analysis with swish. Note that, as opposed to the workflow above, we will make use of the transcript-level abundance estimates. In addition, we need the inferential replicates. Since we ignored these when importing the data above, we will re-import it here.
head(coldata)
# names sample_id line_id condition_name
# 1 SAMEA103885102 diku_A diku_1 naive
# 2 SAMEA103885347 diku_B diku_1 IFNg
# 3 SAMEA103885043 diku_C diku_1 SL1344
# 4 SAMEA103885392 diku_D diku_1 IFNg_SL1344
# 5 SAMEA103885182 eiwy_A eiwy_1 naive
# 6 SAMEA103885136 eiwy_B eiwy_1 IFNg
# files
# 1 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885102/quant.sf.gz
# 2 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885347/quant.sf.gz
# 3 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885043/quant.sf.gz
# 4 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885392/quant.sf.gz
# 5 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885182/quant.sf.gz
# 6 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885136/quant.sf.gz
st <- tximeta(coldata = coldata, type = "salmon", dropInfReps = FALSE)
We can check that the inferential replicates were imported as well:
assayNames(st)
# [1] "counts" "abundance" "length" "infRep1" "infRep2" "infRep3"
# [7] "infRep4" "infRep5" "infRep6" "infRep7" "infRep8" "infRep9"
# [13] "infRep10" "infRep11" "infRep12" "infRep13" "infRep14" "infRep15"
# [19] "infRep16" "infRep17" "infRep18" "infRep19" "infRep20"
Since we are interested in comparing the naive and IFNg groups, we
subset our object to these groups.
st <- st[, st$condition_name %in% c("naive", "IFNg")]
st$condition_name <- factor(st$condition_name, c("naive", "IFNg"))
Next, we run the DTE analysis with swish. First we’ll scale the inferential replicates, followed by labeling the rows with sufficient counts for running differential expression, and then calculating the statistics.
library(fishpond)
st <- scaleInfReps(st, lengthCorrect = TRUE)
# Progress: 1 on 20 Progress: 2 on 20 Progress: 3 on 20 Progress: 4 on 20 Progress: 5 on 20 Progress: 6 on 20 Progress: 7 on 20 Progress: 8 on 20 Progress: 9 on 20 Progress: 10 on 20 Progress: 11 on 20 Progress: 12 on 20 Progress: 13 on 20 Progress: 14 on 20 Progress: 15 on 20 Progress: 16 on 20 Progress: 17 on 20 Progress: 18 on 20 Progress: 19 on 20 Progress: 20 on 20
st <- labelKeep(st)
st <- st[mcols(st)$keep, ]
set.seed(1)
st <- swish(st, x = "condition_name", pair = "line_id", nperms = 100)
# Progress: 1 on 64 Progress: 2 on 64 Progress: 3 on 64 Progress: 4 on 64 Progress: 5 on 64 Progress: 6 on 64 Progress: 7 on 64 Progress: 8 on 64 Progress: 9 on 64 Progress: 10 on 64 Progress: 11 on 64 Progress: 12 on 64 Progress: 13 on 64 Progress: 14 on 64 Progress: 15 on 64 Progress: 16 on 64 Progress: 17 on 64 Progress: 18 on 64 Progress: 19 on 64 Progress: 20 on 64 Progress: 21 on 64 Progress: 22 on 64 Progress: 23 on 64 Progress: 24 on 64 Progress: 25 on 64 Progress: 26 on 64 Progress: 27 on 64 Progress: 28 on 64 Progress: 29 on 64 Progress: 30 on 64 Progress: 31 on 64 Progress: 32 on 64 Progress: 33 on 64 Progress: 34 on 64 Progress: 35 on 64 Progress: 36 on 64 Progress: 37 on 64 Progress: 38 on 64 Progress: 39 on 64 Progress: 40 on 64 Progress: 41 on 64 Progress: 42 on 64 Progress: 43 on 64 Progress: 44 on 64 Progress: 45 on 64 Progress: 46 on 64 Progress: 47 on 64 Progress: 48 on 64 Progress: 49 on 64 Progress: 50 on 64 Progress: 51 on 64 Progress: 52 on 64 Progress: 53 on 64 Progress: 54 on 64 Progress: 55 on 64 Progress: 56 on 64 Progress: 57 on 64 Progress: 58 on 64 Progress: 59 on 64 Progress: 60 on 64 Progress: 61 on 64 Progress: 62 on 64 Progress: 63 on 64 Progress: 64 on 64
The results are stored in mcols(st).
head(mcols(st))
# DataFrame with 6 rows and 10 columns
# tx_id gene_id tx_name log10mean
# <integer> <CharacterList> <character> <numeric>
# ENST00000456328.2 1 ENSG00000223972.5 ENST00000456328.2 1.063836
# ENST00000488147.1 9483 ENSG00000227232.5 ENST00000488147.1 2.185746
# ENST00000461467.1 9486 ENSG00000237613.2 ENST00000461467.1 1.187334
# ENST00000466430.5 9487 ENSG00000238009.6 ENST00000466430.5 1.721425
# ENST00000495576.1 9488 ENSG00000239945.1 ENST00000495576.1 0.803498
# ENST00000442987.3 11 ENSG00000233750.3 ENST00000442987.3 0.954533
# keep stat log2FC pvalue locfdr qvalue
# <logical> <numeric> <numeric> <numeric> <numeric> <numeric>
# ENST00000456328.2 TRUE -6.00 0.0404840 0.4754832 1.000000 0.677493
# ENST00000488147.1 TRUE 5.30 0.0905183 0.5282657 1.000000 0.718250
# ENST00000461467.1 TRUE -6.60 -0.3041986 0.4326085 1.000000 0.642073
# ENST00000466430.5 TRUE -16.00 -0.3159068 0.0425972 0.444626 0.141361
# ENST00000495576.1 TRUE -14.35 -1.0201620 0.0755231 0.653396 0.209693
# ENST00000442987.3 TRUE 1.30 0.1165192 0.8732274 1.000000 0.934212
table(mcols(st)$qvalue < 0.05)
#
# FALSE TRUE
# 53833 12206
## Most significant transcripts
tophits <- mcols(st)[order(mcols(st)$qvalue, -abs(mcols(st)$log2FC)), ]
head(tophits)
# DataFrame with 6 rows and 10 columns
# tx_id gene_id tx_name log10mean
# <integer> <CharacterList> <character> <numeric>
# ENST00000264888.5 51518 ENSG00000138755.5 ENST00000264888.5 5.34048
# ENST00000370459.7 13420 ENSG00000154451.14 ENST00000370459.7 3.85225
# ENST00000522495.5 84040 ENSG00000131203.12 ENST00000522495.5 4.55934
# ENST00000355754.6 13414 ENSG00000162654.8 ENST00000355754.6 4.46060
# ENST00000554772.5 139645 ENSG00000140105.17 ENST00000554772.5 3.82585
# ENST00000306602.2 51519 ENSG00000169245.5 ENST00000306602.2 5.08257
# keep stat log2FC pvalue locfdr
# <logical> <numeric> <numeric> <numeric> <numeric>
# ENST00000264888.5 TRUE 21 10.73012 2.36603e-07 8.03099e-06
# ENST00000370459.7 TRUE 21 10.29332 2.36603e-07 8.03099e-06
# ENST00000522495.5 TRUE 21 9.88143 2.36603e-07 8.03099e-06
# ENST00000355754.6 TRUE 21 9.84063 2.36603e-07 8.03099e-06
# ENST00000554772.5 TRUE 21 9.74347 2.36603e-07 8.03099e-06
# ENST00000306602.2 TRUE 21 9.37261 2.36603e-07 8.03099e-06
# qvalue
# <numeric>
# ENST00000264888.5 2.6358e-06
# ENST00000370459.7 2.6358e-06
# ENST00000522495.5 2.6358e-06
# ENST00000355754.6 2.6358e-06
# ENST00000554772.5 2.6358e-06
# ENST00000306602.2 2.6358e-06
hist(mcols(st)$pvalue, col = "grey")
We can plot the results for some of the most significant transcripts.
plotInfReps(st, idx = rownames(tophits)[1],
x = "condition_name", cov = "line_id")
plotInfReps(st, idx = rownames(tophits)[43],
x = "condition_name", cov = "line_id")
plotInfReps(st, idx = rownames(tophits)[60],
x = "condition_name", cov = "line_id")
plotMASwish(st, alpha = 0.05)
We can also use the plotting functions of swish to plot the inferential replicates for the top-ranked genes in the differential gene expression analysis.
## Summarize on the gene level
sg_inf <- summarizeToGene(st)
sg_inf <- sg_inf[, sg_inf$condition_name %in% c("naive", "IFNg")]
sg_inf$condition_name <- factor(sg_inf$condition_name, c("naive", "IFNg"))
plotInfReps(sg_inf, idx = rownames(resOrdered)[1],
x = "condition_name", cov = "line_id")
Finally, we show how to run differential transcript usage analysis with swish and DEXSeq. With swish, we can build upon the data imported earlier to calculate isoform proportions and perfom a permutation test based on these.
iso <- isoformProportions(st)
# Progress: 1 on 20 Progress: 2 on 20 Progress: 3 on 20 Progress: 4 on 20 Progress: 5 on 20 Progress: 6 on 20 Progress: 7 on 20 Progress: 8 on 20 Progress: 9 on 20 Progress: 10 on 20 Progress: 11 on 20 Progress: 12 on 20 Progress: 13 on 20 Progress: 14 on 20 Progress: 15 on 20 Progress: 16 on 20 Progress: 17 on 20 Progress: 18 on 20 Progress: 19 on 20 Progress: 20 on 20
iso <- swish(iso, x = "condition_name", pair = "line_id",
nperms = 100)
# Progress: 1 on 64 Progress: 2 on 64 Progress: 3 on 64 Progress: 4 on 64 Progress: 5 on 64 Progress: 6 on 64 Progress: 7 on 64 Progress: 8 on 64 Progress: 9 on 64 Progress: 10 on 64 Progress: 11 on 64 Progress: 12 on 64 Progress: 13 on 64 Progress: 14 on 64 Progress: 15 on 64 Progress: 16 on 64 Progress: 17 on 64 Progress: 18 on 64 Progress: 19 on 64 Progress: 20 on 64 Progress: 21 on 64 Progress: 22 on 64 Progress: 23 on 64 Progress: 24 on 64 Progress: 25 on 64 Progress: 26 on 64 Progress: 27 on 64 Progress: 28 on 64 Progress: 29 on 64 Progress: 30 on 64 Progress: 31 on 64 Progress: 32 on 64 Progress: 33 on 64 Progress: 34 on 64 Progress: 35 on 64 Progress: 36 on 64 Progress: 37 on 64 Progress: 38 on 64 Progress: 39 on 64 Progress: 40 on 64 Progress: 41 on 64 Progress: 42 on 64 Progress: 43 on 64 Progress: 44 on 64 Progress: 45 on 64 Progress: 46 on 64 Progress: 47 on 64 Progress: 48 on 64 Progress: 49 on 64 Progress: 50 on 64 Progress: 51 on 64 Progress: 52 on 64 Progress: 53 on 64 Progress: 54 on 64 Progress: 55 on 64 Progress: 56 on 64 Progress: 57 on 64 Progress: 58 on 64 Progress: 59 on 64 Progress: 60 on 64 Progress: 61 on 64 Progress: 62 on 64 Progress: 63 on 64 Progress: 64 on 64
For DEXSeq, we will again re-import the data, since we would like the transcript counts to represent so called ‘scaled TPM’ values (similarly to what swish will do internally when scaling the inferential replicates, to avoid differences in transcript length being interpreted as differences in relative abundance between groups). tximeta can do this for us, effectively populating the ‘counts’ assay with TPMs, scaled up to the observed library size to be comparable in size to the actual counts. For DEXSeq, we further subset the transcripts to those on chromosome 18, for computational time reasons.
head(coldata)
# names sample_id line_id condition_name
# 1 SAMEA103885102 diku_A diku_1 naive
# 2 SAMEA103885347 diku_B diku_1 IFNg
# 3 SAMEA103885043 diku_C diku_1 SL1344
# 4 SAMEA103885392 diku_D diku_1 IFNg_SL1344
# 5 SAMEA103885182 eiwy_A eiwy_1 naive
# 6 SAMEA103885136 eiwy_B eiwy_1 IFNg
# files
# 1 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885102/quant.sf.gz
# 2 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885347/quant.sf.gz
# 3 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885043/quant.sf.gz
# 4 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885392/quant.sf.gz
# 5 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885182/quant.sf.gz
# 6 /Library/Frameworks/R.framework/Versions/4.2/Resources/library/macrophage/extdata/quants/SAMEA103885136/quant.sf.gz
st <- tximeta(coldata = coldata, type = "salmon",
countsFromAbundance = "scaledTPM")
st <- st[, st$condition_name %in% c("naive", "IFNg")]
st$condition_name <- factor(st$condition_name, c("naive", "IFNg"))
st$sample_id <- colnames(st)
st <- st[seqnames(rowRanges(st)) == "chr18", ]
library(DEXSeq)
dxd <- DEXSeqDataSet(countData = round(assay(st, "counts")),
sampleData = as.data.frame(colData(st)),
design = ~sample + exon + condition_name:exon,
featureID = rowData(st)$tx_name,
groupID = as.character(rowData(st)$gene_id))
dxd <- estimateSizeFactors(dxd)
dxd <- estimateDispersions(dxd, quiet = TRUE)
dxd <- testForDEU(dxd, reducedModel = ~sample + exon)
dxr <- DEXSeqResults(dxd, independentFiltering = FALSE)
qval <- perGeneQValue(dxr)
dxr.g <- data.frame(gene = names(qval), qval)
head(dxr)
#
# LRT p-value: full vs reduced
#
# DataFrame with 6 rows and 9 columns
# groupID featureID
# <character> <character>
# ENSG00000262352.1:ENST00000575820.1 ENSG00000262352.1 ENST00000575820.1
# ENSG00000262352.1:ENST00000572573.1 ENSG00000262352.1 ENST00000572573.1
# ENSG00000263305.1:ENST00000572608.1 ENSG00000263305.1 ENST00000572608.1
# ENSG00000263305.1:ENST00000572062.1 ENSG00000263305.1 ENST00000572062.1
# ENSG00000262181.2:ENST00000575066.2 ENSG00000262181.2 ENST00000575066.2
# ENSG00000173213.9:ENST00000308911.8 ENSG00000173213.9 ENST00000308911.8
# exonBaseMean dispersion stat
# <numeric> <numeric> <numeric>
# ENSG00000262352.1:ENST00000575820.1 0.2673067 24 0.00882245
# ENSG00000262352.1:ENST00000572573.1 0.8479414 24 0.00881664
# ENSG00000263305.1:ENST00000572608.1 0.3218679 NA NA
# ENSG00000263305.1:ENST00000572062.1 0.0000000 NA NA
# ENSG00000262181.2:ENST00000575066.2 0.0000000 NA NA
# ENSG00000173213.9:ENST00000308911.8 0.0762727 24 0.07291985
# pvalue padj genomicData countData
# <numeric> <numeric> <GRangesList> <matrix>
# ENSG00000262352.1:ENST00000575820.1 0.925166 1 0:0:0:...
# ENSG00000262352.1:ENST00000572573.1 0.925191 1 0:2:0:...
# ENSG00000263305.1:ENST00000572608.1 NA NA 5:0:0:...
# ENSG00000263305.1:ENST00000572062.1 NA NA 0:0:0:...
# ENSG00000262181.2:ENST00000575066.2 NA NA 0:0:0:...
# ENSG00000173213.9:ENST00000308911.8 0.787132 1 0:0:1:...
Finally, we compare the p-value ranks for the genes shared by the two analyses.
shared <- intersect(dxr$featureID, rownames(mcols(iso)))
plot(rank(dxr[match(shared, dxr$featureID), "pvalue"]),
rank(mcols(iso)[shared, "pvalue"]), cex = 0.1,
xlab = "DEXSeq", ylab = "swish")
It is good practice to always include a list of the software versions that were used to perform a given analysis, for reproducibility and trouble-shooting purposes.
One way of achieving this is via the sessionInfo() function.
sessionInfo()
# R version 4.2.1 (2022-06-23)
# Platform: x86_64-apple-darwin17.0 (64-bit)
# Running under: macOS Monterey 12.4
#
# Matrix products: default
# LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# attached base packages:
# [1] stats4 stats graphics grDevices utils datasets methods
# [8] base
#
# other attached packages:
# [1] forcats_0.5.1 stringr_1.4.0
# [3] dplyr_1.0.9 purrr_0.3.4
# [5] readr_2.1.2 tidyr_1.2.0
# [7] tibble_3.1.7 tidyverse_1.3.1
# [9] ggthemes_4.2.4 gapminder_0.3.0
# [11] ggplot2_3.3.6 MASS_7.3-58
# [13] DEXSeq_1.42.0 RColorBrewer_1.1-3
# [15] BiocParallel_1.30.3 fishpond_2.2.0
# [17] clusterProfiler_4.4.4 topGO_2.48.0
# [19] SparseM_1.81 GO.db_3.15.0
# [21] graph_1.74.0 GeneTonic_2.1.2
# [23] iSEEu_1.8.0 iSEE_2.9.1
# [25] SingleCellExperiment_1.18.0 pheatmap_1.0.12
# [27] apeglm_1.18.0 pcaExplorer_2.23.0
# [29] bigmemory_4.6.1 edgeR_3.38.1
# [31] limma_3.52.2 ExploreModelMatrix_1.8.0
# [33] GenomicFeatures_1.48.3 org.Hs.eg.db_3.15.0
# [35] AnnotationDbi_1.58.0 DESeq2_1.36.0
# [37] SummarizedExperiment_1.26.1 Biobase_2.56.0
# [39] MatrixGenerics_1.8.1 matrixStats_0.62.0
# [41] GenomicRanges_1.48.0 GenomeInfoDb_1.32.2
# [43] IRanges_2.30.0 S4Vectors_0.34.0
# [45] BiocGenerics_0.42.0 tximeta_1.14.0
# [47] macrophage_1.12.0 rmarkdown_2.14
# [49] knitr_1.39 BiocStyle_2.24.0
# [51] BiocManager_1.30.18
#
# loaded via a namespace (and not attached):
# [1] icons_0.2.0 ps_1.7.1
# [3] Rsamtools_2.12.0 foreach_1.5.2
# [5] rprojroot_2.0.3 crayon_1.5.1
# [7] backports_1.4.1 nlme_3.1-158
# [9] reprex_2.0.1 colourpicker_1.1.1
# [11] GOSemSim_2.22.0 rlang_1.0.4
# [13] readxl_1.4.0 XVector_0.36.0
# [15] fontawesome_0.2.2 callr_3.7.1
# [17] filelock_1.0.2 GOstats_2.62.0
# [19] rjson_0.2.21 xaringanExtra_0.6.0
# [21] bit64_4.0.5 glue_1.6.2
# [23] rngtools_1.5.2 parallel_4.2.1
# [25] processx_3.7.0 vipor_0.4.5
# [27] shinyAce_0.4.2 shinydashboard_0.7.2
# [29] DOSE_3.22.0 haven_2.5.0
# [31] tidyselect_1.1.2 XML_3.99-0.10
# [33] GenomicAlignments_1.32.0 xtable_1.8-4
# [35] magrittr_2.0.3 evaluate_0.15
# [37] cli_3.3.0 zlibbioc_1.42.0
# [39] hwriter_1.3.2.1 rstudioapi_0.13
# [41] miniUI_0.1.1.1 bslib_0.3.1
# [43] fastmatch_1.1-3 ensembldb_2.20.2
# [45] treeio_1.20.1 shiny_1.7.1
# [47] xfun_0.31 clue_0.3-61
# [49] pkgbuild_1.3.1 cluster_2.1.3
# [51] tidygraph_1.2.1 TSP_1.2-1
# [53] KEGGREST_1.36.3 interactiveDisplayBase_1.34.0
# [55] expm_0.999-6 ggrepel_0.9.1
# [57] threejs_0.3.3 ape_5.6-2
# [59] dendextend_1.16.0 shinyWidgets_0.7.1
# [61] Biostrings_2.64.0 png_0.1-7
# [63] withr_2.5.0 shinyBS_0.61.1
# [65] bitops_1.0-7 ggforce_0.3.3
# [67] cellranger_1.1.0 RBGL_1.72.0
# [69] plyr_1.8.7 GSEABase_1.58.0
# [71] AnnotationFilter_1.20.0 coda_0.19-4
# [73] xaringan_0.25 pillar_1.7.0
# [75] GlobalOptions_0.1.2 cachem_1.0.6
# [77] fs_1.5.2 GetoptLong_1.0.5
# [79] vctrs_0.4.1 ellipsis_0.3.2
# [81] generics_0.1.3 NMF_0.24.0
# [83] tools_4.2.1 archive_1.1.5
# [85] munsell_0.5.0 tweenr_1.0.2
# [87] fgsea_1.22.0 DelayedArray_0.22.0
# [89] abind_1.4-5 fastmap_1.1.0
# [91] compiler_4.2.1 httpuv_1.6.5
# [93] rtracklayer_1.56.1 pkgmaker_0.32.2
# [95] plotly_4.10.0 GenomeInfoDbData_1.2.8
# [97] gridExtra_2.3 emo_0.0.0.9000
# [99] lattice_0.20-45 visNetwork_2.1.0
# [101] AnnotationForge_1.38.0 utf8_1.2.2
# [103] later_1.3.0 BiocFileCache_2.4.0
# [105] jsonlite_1.8.0 scales_1.2.0
# [107] tidytree_0.3.9 genefilter_1.78.0
# [109] lazyeval_0.2.2 tippy_0.1.0
# [111] promises_1.2.0.1 doParallel_1.0.17
# [113] cowplot_1.1.1 statmod_1.4.36
# [115] webshot_0.5.3 downloader_0.4
# [117] igraph_1.3.2 survival_3.3-1
# [119] numDeriv_2016.8-1.1 rsconnect_0.8.27
# [121] yaml_2.3.5 rintrojs_0.3.0
# [123] htmltools_0.5.2 memoise_2.0.1
# [125] BiocIO_1.6.0 locfit_1.5-9.6
# [127] seriation_1.3.5 graphlayouts_0.8.0
# [129] viridisLite_0.4.0 digest_0.6.29
# [131] assertthat_0.2.1 mime_0.12
# [133] rappdirs_0.3.3 emdbook_1.3.12
# [135] registry_0.5-1 bigmemory.sri_0.1.3
# [137] RSQLite_2.2.14 yulab.utils_0.0.5
# [139] remotes_2.4.2 data.table_1.14.2
# [141] blob_1.2.3 splines_4.2.1
# [143] labeling_0.4.2 AnnotationHub_3.4.0
# [145] ProtGenerics_1.28.0 RCurl_1.98-1.7
# [147] broom_1.0.0 hms_1.1.1
# [149] modelr_0.1.8 colorspace_2.0-3
# [151] base64enc_0.1-3 shape_1.4.6
# [153] aplot_0.1.6 tximport_1.24.0
# [155] sass_0.4.1 Rcpp_1.0.9
# [157] bookdown_0.27 mvtnorm_1.1-3
# [159] circlize_0.4.15 enrichplot_1.16.1
# [161] backbone_2.1.0 fansi_1.0.3
# [163] tzdb_0.3.0 R6_2.5.1
# [165] grid_4.2.1 lifecycle_1.0.1
# [167] formatR_1.12 curl_4.3.2
# [169] ComplexUpset_1.3.3 jquerylib_0.1.4
# [171] svMisc_1.2.3 DO.db_2.9
# [173] Matrix_1.4-1 qvalue_2.28.0
# [175] iterators_1.0.14 htmlwidgets_1.5.4
# [177] polyclip_1.10-0 biomaRt_2.52.0
# [179] crosstalk_1.2.0 shadowtext_0.1.2
# [181] gridGraphics_0.5-1 rvest_1.0.2
# [183] ComplexHeatmap_2.12.0 mgcv_1.8-40
# [185] patchwork_1.1.1 bdsmatrix_1.3-6
# [187] lubridate_1.8.0 codetools_0.2-18
# [189] bs4Dash_2.1.0 gtools_3.9.3
# [191] prettyunits_1.1.1 dbplyr_2.2.1
# [193] gridBase_0.4-7 gtable_0.3.0
# [195] DBI_1.1.3 dynamicTreeCut_1.63-1
# [197] highr_0.9 ggfun_0.0.6
# [199] httr_1.4.3 stringi_1.7.8
# [201] vroom_1.5.7 progress_1.2.2
# [203] reshape2_1.4.4 farver_2.1.1
# [205] uuid_1.1-0 heatmaply_1.3.0
# [207] annotate_1.74.0 viridis_0.6.2
# [209] Rgraphviz_2.40.0 ggtree_3.4.1
# [211] DT_0.23 xml2_1.3.3
# [213] bbmle_1.0.25 shinyjs_2.1.0
# [215] restfulr_0.0.15 geneplotter_1.74.0
# [217] ggplotify_0.1.0 Category_2.62.0
# [219] BiocVersion_3.15.2 bit_4.0.4
# [221] shinycssloaders_1.0.0 scatterpie_0.1.7
# [223] ggraph_2.0.5 pkgconfig_2.0.3